Install the maftools package
if (!require("BiocManager"))
install.packages("BiocManager")
Loading required package: BiocManager
Bioconductor version 3.10 (BiocManager 1.30.10), ?BiocManager::install for help
BiocManager::install("maftools")
Bioconductor version 3.10 (BiocManager 1.30.10), R 3.6.1 (2019-07-05)
Installing package(s) 'maftools'
The downloaded binary packages are in
/var/folders/sm/9snzbscd22s_h2jyvwy6y6sm0000gn/T//RtmpQUijGK/downloaded_packages
Old packages: 'nlme'
library(maftools)
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools')
laml = read.maf(maf = laml.maf, clinicalData = laml.clin)
-Reading
-Validating
-Silent variants: 475
-Summarizing
-Processing clinical data
-Finished in 0.463s elapsed (0.822s cpu)
laml
An object of class MAF
ID summary Mean Median
1: NCBI_Build 37 NA NA
2: Center genome.wustl.edu NA NA
3: Samples 193 NA NA
4: nGenes 1241 NA NA
5: Frame_Shift_Del 52 0.271 0
6: Frame_Shift_Ins 91 0.474 0
7: In_Frame_Del 10 0.052 0
8: In_Frame_Ins 42 0.219 0
9: Missense_Mutation 1342 6.990 7
10: Nonsense_Mutation 103 0.536 0
11: Splice_Site 92 0.479 0
12: total 1732 9.021 9
getSampleSummary(laml)
getFields(laml)
[1] "Hugo_Symbol" "Entrez_Gene_Id" "Center"
[4] "NCBI_Build" "Chromosome" "Start_Position"
[7] "End_Position" "Strand" "Variant_Classification"
[10] "Variant_Type" "Reference_Allele" "Tumor_Seq_Allele1"
[13] "Tumor_Seq_Allele2" "Tumor_Sample_Barcode" "Protein_Change"
[16] "i_TumorVAF_WU" "i_transcript_name"
Hugo_Symbol
Entrez_Gene_Id
Center
NCBI_Build
Chromosome
Start_Position
End_Position
Strand
Variant_Classification
Variant_Type
Reference_Allele
Tumor_Seq_Allele1
Tumor_Seq_Allele2
Tumor_Sample_Barcode
Protein_Change
i_TumorVAF_WU
i_transcript_name
getClinicalData(laml)
getGeneSummary(laml)
write.mafSummary(maf = laml, basename = 'laml')
plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)
oncoplot(maf = laml, top = 10)
oncostrip(maf = laml, genes = c('DNMT3A','NPM1', 'RUNX1'))
titv function classifies SNPs into Transitions and Transversions and returns a list of summarized tables in various ways. Summarized data can also be visualized as a boxplot showing overall distribution of six different conversions and as a stacked barplot showing fraction of conversions in each sample.
laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv)
Showing mutation spots on protein structure with lollipop plots. It will show the most mutant spot.
#lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.
lollipopPlot(maf = laml, gene = 'DNMT3A', AACol = 'Protein_Change', showMutationRate = TRUE)
3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
HGNC refseq.ID protein.ID aa.length
1: DNMT3A NM_175629 NP_783328 912
2: DNMT3A NM_022552 NP_072046 912
3: DNMT3A NM_153759 NP_715640 723
Using longer transcript NM_175629 for now.
Removed 3 mutations for which AA position was not available
lollipopPlot(maf = laml, gene = 'KIT', AACol = 'Protein_Change', labelPos = 816, refSeqID = 'NM_000222')
Rainfall plot also highlights regions where potential changes in inter-event distances are located.
brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)
rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.6)
Processing TCGA-A8-A08B..
Kataegis detected at:
Chromosome Start_Position End_Position nMuts Avg_intermutation_dist Size
1: 8 98129391 98133560 6 833.8000 4169
2: 8 98398603 98403536 8 704.7143 4933
3: 8 98453111 98456466 8 479.2857 3355
4: 8 124090506 124096810 21 315.2000 6304
5: 12 97437934 97439705 6 354.2000 1771
6: 17 29332130 29336153 7 670.5000 4023
Tumor_Sample_Barcode C>G C>T
1: TCGA-A8-A08B 4 2
2: TCGA-A8-A08B 1 7
3: TCGA-A8-A08B NA 8
4: TCGA-A8-A08B 1 20
5: TCGA-A8-A08B 3 3
6: TCGA-A8-A08B 4 3
TCGA contains over 30 different cancer cohorts and median mutation load across them varies from as low as 7 per exome to as high as 315 per exome. It is informative to see how mutation load in given maf stands against TCGA cohorts.
laml.mutload = tcgaCompare(maf = laml, cohortName = 'Example-LAML')
plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')
Plot word cloud plot for mutated genes with the function geneCloud
geneCloud(input = laml, minMut = 3)
Genes occurs mutated in cancer are co-occuring in their mutation pattern. using somaticInteractions to detect this relationship.
somaticInteractions(maf = laml, top = 25, pvalue = c(0.05, 0.1))
Detecting cancer dirver genes using oncodrive function that using oncodriveCLUST algorithm.
laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')
Estimating background scores from synonymous variants..
Not enough genes to build background. Using predefined values. (Mean = 0.279; SD = 0.13)
Estimating cluster scores from non-syn variants..
|
| | 0%
|
|=== | 4%
|
|====== | 9%
|
|========= | 13%
|
|============ | 17%
|
|=============== | 22%
|
|================== | 26%
|
|===================== | 30%
|
|======================== | 35%
|
|=========================== | 39%
|
|============================== | 43%
|
|================================= | 48%
|
|===================================== | 52%
|
|======================================== | 57%
|
|=========================================== | 61%
|
|============================================== | 65%
|
|================================================= | 70%
|
|==================================================== | 74%
|
|======================================================= | 78%
|
|========================================================== | 83%
|
|============================================================= | 87%
|
|================================================================ | 91%
|
|=================================================================== | 96%
|
|======================================================================| 100%
Comapring with background model and estimating p-values..
Done !
head(laml.sig)
plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE)
Adds pfam domain information to the amino acid changes
laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)
Warning in pfamDomains(maf = laml, AACol = "Protein_Change", top = 10): Removed
50 mutations for which AA position was not available
laml.pfam$proteinSummary[,1:7, with = FALSE]
laml.pfam$domainSummary[,1:3, with = FALSE]
#MutsigCV results for TCGA-AML
laml.mutsig <- system.file("extdata", "LAML_sig_genes.txt.gz", package = "maftools")
pancanComparison(mutsigResults = laml.mutsig, qval = 0.1, cohortName = 'LAML', inputSampleSize = 200, label = 1)
Significantly mutated genes in LAML (q < 0.1): 23
Significantly mutated genes in PanCan cohort (q <0.1): 114
Significantly mutated genes exclusive to LAML (q < 0.1):
gene pancan q nMut log_q_pancan log_q
1: CEBPA 1.000 3.500301e-12 13 0.00000000 11.455895
2: EZH2 1.000 7.463546e-05 3 0.00000000 4.127055
3: GIGYF2 1.000 6.378338e-03 2 0.00000000 2.195292
4: KIT 0.509 1.137517e-05 8 0.29328222 4.944042
5: PHF6 0.783 6.457555e-09 6 0.10623824 8.189932
6: PTPN11 0.286 7.664584e-03 9 0.54363397 2.115511
7: RAD21 0.929 1.137517e-05 5 0.03198429 4.944042
8: SMC1A 0.801 2.961696e-03 6 0.09636748 2.528460
9: TET2 0.907 2.281625e-13 17 0.04239271 12.641756
10: WT1 1.000 2.281625e-13 12 0.00000000 12.641756
#Survival analysis based on grouping of DNMT3A mutation status
mafSurvival(maf = laml, genes = 'DNMT3A', time = 'days_to_last_followup', Status = 'Overall_Survival_Status', isTCGA = TRUE)
Looking for clinical data in annoatation slot of MAF..
Number of mutated samples for given genes:
DNMT3A
48
Removed 11 samples with NA's
Median survival..
Group medianTime N
1: Mutant 245 45
2: WT 396 137
Identify set of genes which results in poor survival
prog_geneset = survGroup(maf = laml, top = 20, geneSetSize = 2, time = "days_to_last_followup", Status = "Overall_Survival_Status", verbose = FALSE)
Removed 11 samples with NA's
print(prog_geneset)
Gene_combination P_value hr WT Mutant
1: FLT3_DNMT3A 0.00104 2.510 164 18
2: DNMT3A_SMC3 0.04880 2.220 176 6
3: DNMT3A_NPM1 0.07190 1.720 166 16
4: DNMT3A_TET2 0.19600 1.780 176 6
5: FLT3_TET2 0.20700 1.860 177 5
6: NPM1_IDH1 0.21900 0.495 176 6
7: DNMT3A_IDH1 0.29300 1.500 173 9
8: IDH2_RUNX1 0.31800 1.580 176 6
9: FLT3_NPM1 0.53600 1.210 165 17
10: DNMT3A_IDH2 0.68000 0.747 178 4
11: DNMT3A_NRAS 0.99200 0.986 178 4
mafSurvGroup(maf = laml, geneSet = c("DNMT3A", "FLT3"), time = "days_to_last_followup", Status = "Overall_Survival_Status")
Looking for clinical data in annoatation slot of MAF..
Removed 11 samples with NA's
Median survival..
Group medianTime N
1: Mutant 242.5 18
2: WT 379.5 164
#Primary APL MAF
primary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools")
primary.apl = read.maf(maf = primary.apl)
-Reading
-Validating
--Non MAF specific values in Variant_Classification column:
ITD
-Silent variants: 45
-Summarizing
-Processing clinical data
--Missing clinical data
-Finished in 0.115s elapsed (0.135s cpu)
#Relapse APL MAF
relapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools")
relapse.apl = read.maf(maf = relapse.apl)
-Reading
-Validating
--Non MAF specific values in Variant_Classification column:
ITD
-Silent variants: 19
-Summarizing
-Processing clinical data
--Missing clinical data
-Finished in 0.082s elapsed (0.144s cpu)
compare two different type of cohorts to detect differnet mutations.
pt.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'Primary', m2Name = 'Relapse', minMut = 5)
print(pt.vs.rt)
$results
Hugo_Symbol Primary Relapse pval or ci.up ci.low
1: PML 1 11 1.529935e-05 0.03537381 0.2552937 0.000806034
2: RARA 0 7 2.574810e-04 0.00000000 0.3006159 0.000000000
3: RUNX1 1 5 1.310500e-02 0.08740567 0.8076265 0.001813280
4: FLT3 26 4 1.812779e-02 3.56086275 14.7701728 1.149009169
5: ARID1B 5 8 2.758396e-02 0.26480490 0.9698686 0.064804160
6: WT1 20 14 2.229087e-01 0.60619329 1.4223101 0.263440988
7: KRAS 6 1 4.334067e-01 2.88486293 135.5393108 0.337679367
8: NRAS 15 4 4.353567e-01 1.85209500 8.0373994 0.553883512
9: ARID1A 7 4 7.457274e-01 0.80869223 3.9297309 0.195710173
adjPval
1: 0.0001376942
2: 0.0011586643
3: 0.0393149868
4: 0.0407875250
5: 0.0496511201
6: 0.3343630535
7: 0.4897762916
8: 0.4897762916
9: 0.7457273717
$SampleSummary
Cohort SampleSize
1: Primary 124
2: Relapse 58
forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1, color = c('royalblue', 'maroon'), geneFontSize = 0.8)
genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3")
coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE)
lollipopPlot2(m1 = primary.apl, m2 = relapse.apl, gene = "PML", AACol1 = "amino_acid_change", AACol2 = "amino_acid_change", m1_name = "Primary", m2_name = "Relapse")
9 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
HGNC refseq.ID protein.ID aa.length
1: PML NM_033238 NP_150241 882
2: PML NM_002675 NP_002666 633
3: PML NM_033249 NP_150252 585
4: PML NM_033247 NP_150250 435
5: PML NM_033239 NP_150242 829
6: PML NM_033250 NP_150253 781
7: PML NM_033240 NP_150243 611
8: PML NM_033244 NP_150247 560
9: PML NM_033246 NP_150249 423
Using longer transcript NM_033238 for now.
9 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
HGNC refseq.ID protein.ID aa.length
1: PML NM_033238 NP_150241 882
2: PML NM_002675 NP_002666 633
3: PML NM_033249 NP_150252 585
4: PML NM_033247 NP_150250 435
5: PML NM_033239 NP_150242 829
6: PML NM_033250 NP_150253 781
7: PML NM_033240 NP_150243 611
8: PML NM_033244 NP_150247 560
9: PML NM_033246 NP_150249 423
Using longer transcript NM_033238 for now.
clinicalEnrichment is another function which takes any clinical feature associated with the samples and performs enrichment analysis. It performs various groupwise and pairwise comparisions to identify enriched mutations for every category within a clincila feature. Below is an example to identify mutations associated with FAB_classification.
fab.ce = clinicalEnrichment(maf = laml, clinicalFeature = 'FAB_classification')
Sample size per factor in FAB_classification:
M0 M1 M2 M3 M4 M5 M6 M7
19 44 44 21 39 19 3 3
plotEnrichmentResults(enrich_res = fab.ce, pVal = 0.05)
Check the interaction from the Drug Gene Interaction database
dgi = drugInteractions(maf = laml, fontSize = 0.75)
dnmt3a.dgi = drugInteractions(genes = "DNMT3A", drugs = TRUE)
Number of claimed drugs for given genes:
Gene N
1: DNMT3A 7
Find the oncogenic signaling pathways in TCGA cohorts.
OncogenicPathways(maf = laml)
Pathway alteration fractions
Pathway N n_affected_genes fraction_affected
1: RTK-RAS 85 18 0.21176471
2: Hippo 38 7 0.18421053
3: NOTCH 71 6 0.08450704
4: MYC 13 3 0.23076923
5: WNT 68 3 0.04411765
6: TP53 6 2 0.33333333
7: NRF2 3 1 0.33333333
8: PI3K 29 1 0.03448276
9: Cell_Cycle 15 0 0.00000000
10: TGF-Beta 7 0 0.00000000
PlotOncogenicPathways(maf = laml, pathways = "RTK-RAS")
Each cancer has its own mutation just like our signatures. So that we could make a signature characterized by specific pattern of nucleotide substitutions.
library(BSgenome.Hsapiens.UCSC.hg19, quietly = TRUE)
Attaching package: 'BiocGenerics'
The following objects are masked from 'package:parallel':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, parApply, parCapply, parLapply,
parLapplyLB, parRapply, parSapply, parSapplyLB
The following objects are masked from 'package:stats':
IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':
Filter, Find, Map, Position, Reduce, anyDuplicated, append,
as.data.frame, basename, cbind, colnames, dirname, do.call,
duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
tapply, union, unique, unsplit, which, which.max, which.min
Warning: package 'S4Vectors' was built under R version 3.6.3
Attaching package: 'S4Vectors'
The following object is masked from 'package:base':
expand.grid
Attaching package: 'Biostrings'
The following object is masked from 'package:base':
strsplit
laml.tnm = trinucleotideMatrix(maf = laml, prefix = 'chr', add = TRUE, ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
Warning in trinucleotideMatrix(maf = laml, prefix = "chr", add = TRUE, ref_genome = "BSgenome.Hsapiens.UCSC.hg19"): Chromosome names in MAF must match chromosome names in reference genome.
Ignorinig 101 single nucleotide variants from missing chromosomes chr23
-Extracting 5' and 3' adjacent bases
-Extracting +/- 20bp around mutated bases for background C>T estimation
-Estimating APOBEC enrichment scores
--Performing one-way Fisher's test for APOBEC enrichment
---APOBEC related mutations are enriched in 3.315 % of samples (APOBEC enrichment score > 2 ; 6 of 181 samples)
-Creating mutation matrix
--matrix of dimension 188x96
plotApobecDiff(tnm = laml.tnm, maf = laml, pVal = 0.2)
$results
Hugo_Symbol Enriched nonEnriched pval or ci.up
1: TP53 2 13 0.08175632 5.9976455 46.608861
2: TET2 1 16 0.45739351 1.9407002 18.983979
3: FLT3 2 45 0.65523131 1.4081851 10.211621
4: DNMT3A 1 47 1.00000000 0.5335362 4.949499
5: ADAM11 0 2 1.00000000 0.0000000 164.191472
---
132: WAC 0 2 1.00000000 0.0000000 164.191472
133: WT1 0 12 1.00000000 0.0000000 12.690862
134: ZBTB33 0 2 1.00000000 0.0000000 164.191472
135: ZC3H18 0 2 1.00000000 0.0000000 164.191472
136: ZNF687 0 2 1.00000000 0.0000000 164.191472
ci.low adjPval
1: 0.49875432 1
2: 0.03882963 1
3: 0.12341748 1
4: 0.01101929 1
5: 0.00000000 1
---
132: 0.00000000 1
133: 0.00000000 1
134: 0.00000000 1
135: 0.00000000 1
136: 0.00000000 1
$SampleSummary
Cohort SampleSize Mean Median
1: Enriched 6 7.167 6.5
2: nonEnriched 172 9.715 9.0
Signature analysis includes following steps.
library('NMF')
Loading required package: pkgmaker
Loading required package: registry
Attaching package: 'pkgmaker'
The following object is masked from 'package:S4Vectors':
new2
Loading required package: rngtools
Loading required package: cluster
NMF - BioConductor layer [OK] | Shared memory capabilities [NO: bigmemory] | Cores 3/4
To enable shared memory capabilities, try: install.extras('
NMF
')
Attaching package: 'NMF'
The following object is masked from 'package:S4Vectors':
nrun
laml.sign = estimateSignatures(mat = laml.tnm, nTry = 6, pConstant = 0.1)
-Running NMF for 6 ranks
Compute NMF rank= 2 ... + measures ... OK
Compute NMF rank= 3 ... + measures ... OK
Compute NMF rank= 4 ... + measures ... OK
Compute NMF rank= 5 ... + measures ... OK
Compute NMF rank= 6 ... + measures ... OK
-Finished in 00:01:55 elapsed (14.8s cpu)
plotCophenetic(res = laml.sign)
laml.sig = extractSignatures(mat = laml.tnm, n = 3, pConstant = 0.1)
-Running NMF for factorization rank: 3
-Finished in2.494s elapsed (2.221s cpu)
laml.og30.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "legacy")
-Comparing against COSMIC signatures
------------------------------------
--Found Signature_1 most similar to COSMIC_1
Aetiology: spontaneous deamination of 5-methylcytosine [cosine-similarity: 0.84]
--Found Signature_2 most similar to COSMIC_1
Aetiology: spontaneous deamination of 5-methylcytosine [cosine-similarity: 0.577]
--Found Signature_3 most similar to COSMIC_5
Aetiology: Unknown [cosine-similarity: 0.851]
------------------------------------
#Compate against updated version3 60 signatures
laml.v3.cosm = compareSignatures(nmfRes = laml.sig, sig_db = "SBS")
-Comparing against COSMIC signatures
------------------------------------
--Found Signature_1 most similar to SBS1
Aetiology: spontaneous or enzymatic deamination of 5-methylcytosine [cosine-similarity: 0.858]
--Found Signature_2 most similar to SBS6
Aetiology: defective DNA mismatch repair [cosine-similarity: 0.538]
--Found Signature_3 most similar to SBS3
Aetiology: Defects in DNA-DSB repair by HR [cosine-similarity: 0.836]
------------------------------------
library('pheatmap')
pheatmap::pheatmap(mat = laml.og30.cosm$cosine_similarities, cluster_rows = FALSE, main = "cosine similarity against validated signatures")
maftools::plotSignatures(nmfRes = laml.sig, title_size = 0.8)
Signatures can further be assigned to samples and enrichment analysis can be performd using signatureEnrichment funtion, which identifies mutations enriched in every signature identified.
laml.se = signatureEnrichment(maf = laml, sig_res = laml.sig)
Running k-means for signature assignment..
Performing pairwise and groupwise comparisions..
Sample size per factor in Signature:
Signature_1 Signature_2 Signature_3
60 65 63
Estimating mutation load and signature exposures..
plotEnrichmentResults(enrich_res = laml.se, pVal = 0.05)